Exploratory data analysis of red wine by Charles Brands

In this analysis we look into the effect of different chemical properties on the wine quality. Below I show the structure of the dataset. This dataset has 12 variables, the quality and 11 chemical properties of the wine (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, and alcohol). The quality is a discrete value the chemical properties are continuous. This dataset has 1599 observations.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Plots Section

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

As the table and plot above show there are no wines in this dataset with a quality below 3 or above 8. Most wines are either a 5 or a 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The quality of the wines run from 3 to 8. The mean is 5.6 the median is 6.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity is a right tailed distribution. The median is at 7.9 g/dm^3. Due to the outliers on the right the mean is pulled to 8.3 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The volatile acidity is a normal distribution with a few outliers to the right. The mean and median are pretty close together at 0.5200 and 0.5278 g/dm^3 respectively.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The citric acid distribution is weird. It looks like two right skewed distribution on top of oneother.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The residual sugar distribution has a sharp peak at around 2.2 g/dm^3. Although the tail is long the peak is so far above the rest that the mean is only pulled a little bit to the right of the median. The boxplot is also rather flat as the amount of wines with a residual sugar around 2.2 is huge compared to the rest.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The chlorides distribution is also sharply peaked. This time the peak is around 0.079 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The free sulfur distribution has a tail on the right. The median and mean are 14.00 and 15.87 mg/dm^3 respectively.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The total sulur distribution is a right tailed distribution with some outlier very far to the right. The median and the mean are 38.00 and 46.47 mg/dm^3 respectively with a max at 289 mg/dm^3!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The density is a normal distributin with a mean of 0.9967 g/cm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The ph distribution is a normal distribution with a mean at 3.311

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The sulfates distribution is a right skewed distributionwith some outliers far to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol percentage distribution is right skewd with a median of 10.20% and a mean of 10.42%.

Univariate Analysis

What is the structure of your dataset?

The red wine dataset had originally 1599 rows and 13 columns. I deleted the column X as it basically is the same as the rownumber. So I had 12 columns left. The columns are the quality and 11 chemical properties of the wine (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality). The quality is a discrete with values: 4, 5, 6, 7, and 8 the rest is continuous.

What is/are the main feature(s) of interest in your dataset?

The most important feature of the dataset is quality. I am interested how the quality is affected by the other properties of the wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

At this point we have not yet analized the data. As I don’t drink alcohol and have never tasted wine I am guessing here. As people in general seem to like sugar and alcohol I would expect that those properties have a positive impact
on the quality. I would expect sulfur, sulfates and chlorides to have a negative effect.

Did you create any new variables from existing variables in the dataset?

No. I did delete the column X as the sample number ad the rowcount are the same.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The citric acid distribution is very unusual. It looks like two right skewed distribution on top of oneother. Other than removing the column X as described above I did not change the data.

Bivariate Plots Section

I am interested in the chemical properties that have an effect on the quality of the wine. I intend to plot the quality with all the available properties but first let us make a correlation table to get a first idea of which properties have the strongest effect on the quality of the wine.

## [1] "Correlation of wine quality with different properties"
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           0.01373164          -0.12890656          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##            sulphates              alcohol 
##           0.25139708           0.47616632

We see that alcohol, sulphates, citric acidity, and fixed acidity have a positive correlation with the wine quality. Volatile acidity, chlorides, total sulfur dioxide, and density have a negative correlation. The rest doesn’t seem to do much.

In this plot we see that fixed acidity has almost no effect on the wine quality.

This plot clearly shows that increasing the volitile acidity degrades the wine quality.

Increasing the citric acidity increases the quality of the wine.

Residual sugar seems to have no effect on the wine quality.

Chlorides have a negative impact on the wine quality.

Free sulfur dioxide is mostly present in the wines of average quality. Both the good wines as the terrible wines have a lower free sulfur dioxide concentration.

Total sulfur dioxide is mostly present in the wines of average quality. Both the good wines as the terrible wines have a lower total sulfur dioxide concentration.

Increasing the density loweres the quality of the wine.

Lowering the pH has a positive impact on the quality of the wine.

Increasing sulphates concentration increases the quality of the wine.

Increasing the alcohol concentration increases the wine quality.

We can see that increasing the alcohol percentage loweres the density. This is not surpricing as alcohol has a lower density than water. This can explain the increase in wine quality as the density is lowered. The other properties that had an effect on wine quality did not have a clear relation with alcohol percentage.

We see that increasing the citric acidity concentration loweres the pH value. Not surpricing.

Increasing the fixed acidity lowers the pH.

Volatile acidity has not much effect on the pH.

##       cor 
## 0.6676665

The correlation between total sulfur dioxide and free sulfur dioxide is 0.668 the strongest I found in this dataset.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The strongest positive correlation with quality is the alcohol percentage. Further sulphates and citric acidity had positive effects on the wine quality. The wine quality was negatively influenced by volatile aidity, chlorides, and density. The influence of density could be explained by the negative correlation with alcohol. Contrary to my expectation residual sugar levels had little to no effect on the wine quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Fixed acidity and citric acidity had a strong negative effect on the pH while volatile acidity had no effect on the pH. Yet Fixed acidity had no effect on the quality of the wine. So apparently the quality of the wine is not determined by the acidity of the wine but is more influenced by the precence of citric acid which has a positive effect on quality and volatile acidity which has a negative effect.

What was the strongest relationship you found?

The relation between total sulfur dioxide and free sulfur dioxide was the strongest relation found with a correlation coefficient of 0,668.

Multivariate Plots Section

Higher quality wines are produced with a higher alcohol content and a higher sulphates concentration.

In the bivariate section that increasing the density loweres the quality of the wine. In this plot we see that this is mostly due to the fact that a lower density means a higher alcohol concentration (alcohol is lighter than water). The better quality wine is caused by the higher alcohol percentage not the lower density.

Higher volatile acidity loweres the wine quality. The pH itself has little effect.

More citric acid results in better wines. Again the influence of pH is small at best.

This plot shows the combined effect of higher citric acid and lower volatile acidity. With the exception of a few outliers the best wines are at the bottom right of this plot.

This plot shows that good wines are produced by higher alcohol and lower volatile acidity. It also shows that the effect of alcohol is larger than the effect of volatile accidity. Below an alcohol percentage of 10% it is very hard to produce a good wine.

The best wines are produced by a high alcohol percentage and a larger citric acid concentration. Again the effect of alcohol is the strongest.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The best wines are produced by highe alcohol percentage, a higher sulphates concentration (0.8 - 1.1), a higher citric acid concentration and a low volatile acidity.

Were there any interesting or surprising interactions between features?

Residual sugar had almost no effect on the wine quality. Wine quality was not much influenced by the pH itself, but more on the presence of citric acid and absense of volatile acid.


Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Description One

This report is abput the wine quality so the first plot I have choosen is the wine quality distribution. The distribution apears to be normal with most wines having a quality of 5 or 6. The mean is 5.6. The scale runs from 0 to 10 but no wines with a quality below 3 or above 8 were found in this dataset.

Plot Two

Description Two

The property with the largest effect on wine quality is alcohol percentage. A higher alcohol percentage gives better wines. However as the overlap in the boxplot show alcohol percentage alone is not enough to garantee a good wine.

Plot Three

Description Three

After the effect of alcohol the strongest effect on wine quality comes from volatile acidity or better the lack of volatile acidity. In general the lower the volatile acidity the better the wine. In this plot the best wines are in the lower right corner while the terrible wines are found in the upper left corner. It is also clear that the effect of alcohol percentage is stronger. Below 10% there are hardly any good wines.


Reflection

This data set contains information on 1599 red wines with twelve variables. The quality and eleven chemical properties. The quality is discrete and the rest is continuous. The quality is a scale from 0 to 10 but no wines with a quality below 3 or above 8 were found in this dataset. This study could be improved by gathering more data from wines with very low or very high quality scores.

Wine quality improved by increasing the alcohol percentage and the citric acid concentration. Increasing the volatile acidity concentration degrades the wine quality.

I was suprised to find that the risidual sugar concentration had almost no influence on the wine quality. Clearly not everything gets better from adding sugar!

I was also suprised to find that increasing the sulfates concentration from 0.5 to 0.8 g/dm^3 improved the quality. Adding more sulfates did not seem to help.